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Abstract. Countless variants of the Lempel-Ziv compression are widely 
used in many real-life applications. This paper is concerned with a natu- 
ral modification of the classical pattern matching problem inspired by the 
popularity of such compression methods: given an uncompressed pattern 
s[l . . m] and a Lempel-Ziv representation of a string f [1 . . N], does s oc- 
cur in tl Farach and Thorup [6] gave a randomized C(nlog 2 ^ +m) time 
solution for this problem, where n is the size of the compressed represen- 
tation of t. Building on the methods of jl] and [7] , we improve their result 
by developing a faster and fully deterministic C(nlog — +m) time algo- 
rithm with the same space complexity. Note that for highly compressible 
texts, log ^ might be of order n, so for such inputs the improvement is 
very significant. A (tiny) fragment of our method can be used to give 
an asymptotically optimal solution for the substring hashing problem 
considered by Farach and Muthukrishnan [5]. 
Key-words: pattern matching, compression, Lempel-Ziv 



1 Introduction 

Effective compression methods allow us to decrease the space requirements which 
is clearly worth pursuing on its own. On the other hand, we do not want to 
store the data just for the sake of having it: we want to process it efficiently on 
demand. This suggest an interesting direction: can we process the data without 
actually decompressing it? Or, in other words, can we speed up processing if 
the compression ratio is high? Answer to such questions clearly depends on the 
particular compression and processing method chosen. In this paper we focus 
on Lempel-Ziv (also known as LZ77, or simply LZ for the sake of brevity), one 
of the most commonly used compression methods being the basis of the widely 
popular zip and gz archive file formats, and on pattern matching, one of the most 
natural text processing problem we might encounter. More specifically, we deal 
with the compressed pattern matching problem: given an uncompressed pattern 
s[l . . m] and a LZ representation of a string t[l . . AT], does s occur in tl This line 
of research has been addressed before quite a few times already. Amir, Benson, 
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and Farach [T] considered the problem with LZ replaced by Lempel-Ziv- Welch (a 
simpler and easier to implement specialization of LZ), giving two solutions with 
complexities 0(n log m+m) and 0(n+m 2 ), where n is the size of the compressed 
representation. The latter has been soon improved [13] to 0(n + m 1+e ). Then 
Farach and Thorup [6 considered the problem in its full generality and gave a 
(randomized) O(nlog 2 — + m) time algorithm for the LZ case. Their solution 
consists of two phases, called winding and unwinding, the first one uses a cleverly 
chosen potential function, and the second one adds fingerprinting in the spirit 
of string hashing of Karp and Rabin [TU] . While a recent result of [5] shows that 
the winding can be performed in just O(nlog — ), it is not clear how to use it to 
improve the whole running time (or remove randomization) . In this paper we take 
a completely different approach, and manage to develop a O(nlog — + m) time 
algorithm. This complements our recent result from SODAT1 [7] showing that 
in case of Lempel-Ziv- Welch, the compressed pattern matching can be solved in 
optimal linear time. The space usage of the improved algorithm is the same as 
in the solution of Farach and Thorup, O(nlog ^ + m). 

Besides the algorithm of Farach and Throup, the only other result that can 
be applied to the LZ case we are aware of is the work of Kida et al. [11] . They 
considered the so-called collage systems allowing to capture many existing com- 
pression schemes, and developed an efficient pattern matching algorithm for 
them. While it does not apply directly to the LZ compression, we can transform 
a LZ parse into a non-truncating collage system with a slight increase in the size, 
see section § The running time (and space usage) of the resulting algorithm is 
C(nlog^ +m 2 ). While m 2 might be acceptable from a practical point of view, 
removing the quadratic dependency on the pattern length seems to be a non- 
trivial and fascinating challenge from a more theoretical angle, especially given 
that for some highly compressible texts n might be much smaller than m. Cit- 
ing [IT], even decreasing the dependency to m 15 logm (the best preprocessing 
complexity known for the LZW case [13] at the time) "is a challenging problem". 

While we were not able to achieve linear time for the general LZ case, the 
algorithm developed in this paper not only significantly improves the previ- 
ously known time bounds, but also is fully deterministic and (relatively) simple. 
Moreover, LZ compression allows for an exponential decrease in the size of the 
compressed text, while in LZW n is at least y/~N. In order to deal with such 
highly compressible texts efficiently we need to combine quite a few different 
ideas, and the nonlinear time of our (and the previously known) solution might 
be viewed as an evidence that LZ is substantially more difficult to deal with than 
LZW. While most of those ideas are simple, they are very carefully chosen and 
composed in order to guarantee the C(nlog — + m) running time. We believe 
the simplicity of those basic building blocks should not be viewed as a draw- 
back. On the contrary, it seems to us that improving a previously known result 
(which used fairly complicated techniques) by a careful combination of simple 
tools should be seen as an advantage. We also argue that in a certain sense, our 
result is the best possible: if integer division is not allowed, our algorithm can 
be implemented in C(n log N + m) time, and this is the best time possible. 
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2 Overview of the algorithm 

Our goal is to detect an occurrence of s in a given Lempel-Ziv compressed text 

t[l . . N], The Lempel-Ziv representation is quite difficult to work with efficiently, 

even for a such simple task as extracting a single letter. The starting point of 

our algorithm is thus transforming the input into a straight-line program, which 

is a context-free grammar with each nonterminal generating exactly one string. 

For that we use the method of Charikar et al. [4 to construct a SLP of size 

C(nlog with additional property that all productions are balanced, meaning 

\x\ 1 — 

that the right sides arc of the form XY with y^— < Wf < — - for some constant 

° 1 — ol — | Y | — a 

a, where \X\ is the length of the (unique) string generated by X. Note that 
Rytter gave a much simpler algorithm |15j with the same size guarantee, using 
the so-called AVL grammars but we need the grammar to be balanced. We also 
need to add a small modification to allow self- referential LZ. 

After transforming the text into a balanced SLP, for each nonterminal we try 
to check if the string it represents occurs inside s, and if so, compute the position 
of (any) its occurrence. Otherwise we would like to compute the longest prefix 
(suffix) of this string which is a suffix (prefix) of s. At first glance this might 
seem like a different problem that the one promised to solve: instead of locating 
an occurrence of the pattern in the text, we retrieve the positions of fragments 
of the text in the pattern. Nevertheless, solving it efficiently gives us enough 
information to answer the original question due to a constant time procedure 
which detects an occurrence of s in a concatenation of two its substrings. 

The first (simple) algorithm for processing a balanced SLP we develop re- 
quires as much as 0(log m) time per query, which results in Oin log — log m+m) 
total complexity. This is clearly not enough to beat [B] on all possible inputs. 
Hence instead of performing the computation for each nonterminal separately, 
we try to process them in 0(log N) groups corresponding to the (truncated) log- 
arithm of their length. Using the fact that the grammar is balanced, we are then 
able to achieve 0(n log — + m log m) time. Because of some technical difficulties, 
in order to decrease this complexity we cannot really afford to check if the repre- 
sented string occurs in s for each nonterminal exactly, though. Nevertheless, we 
can compute some approximation of this information, and by using a tailored 
variant of binary search applied to all nonterminals in a single group at once, 
we manage to process the whole grammar in time proportional to its size while 
adding just 0(m) to the running time. 

3 Preliminaries 

The computational model we are going to use is the standard RAM allowing 
direct and indirect addressing, addition, subtraction, integer division and condi- 
tional jump with word size w > maxjlog n, log N}. One usually allows multipli- 
cation as well in this model but we do not need it, and the only place where we 
use integer division (which in some cases is known to significantly increase the 
computational power) , is the proof of Lemma ^1 
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We do not assume that any other operation (like, for example, taking loga- 
rithms) can be performed in constant time on arbitrary words of size w. Nev- 
ertheless, because of the n addend in the final running time, we can afford to 
preprocess the results on words of size log n and hence assume that some addi- 
tional (reasonable) operations can be performed in constant time on such inputs. 

As usually, \w\ stands for the length of w, w[i . .j] refers to its fragment of 
length j — i + 1 beginning at the i-th character, where characters are numbered 
starting from 1. All strings are over an alphabet S of polynomial cardinality, 
namely £ ~ {1, 2, . . . , (n + m) c }. A border of w[l . . \w\] is a fragment which is 
both a prefix and a suffix of w, i.e., w[l . . i] = w[\w\ — i + 1 . . We identify 
such fragment with its length and say that border(i) = . . . ,i/~} is the set 
of all borders of t. A period of a string w[l . . \w\] is an integer p such that 
w[i] = w[i + p] for all 1 < i < \w\ — p. Note that p is a period of iff |w| — p is a 
border. The following lemma is a well-known property of periods. 

Lemma 1 (Periodicity lemma). If p and q are both periods ofw, andp + q < 
\w\ + gcd(p, q), then gcd(p, q) is a period as well. 

The Lcmpel-Ziv representation of a string t[l . . N) is a sequence of triples 
(starti,leni,nexti) for i — 1,2, ... ,n, where n is the size of the representa- 
tion, starti and lerii are nonnegative integers, and nexti € S. Such triple 
refers to a fragment of the text t[starti . . starti + lerii — 1] and defines t[l + 
J2j<i^ en j ■ ■^2j<i^ en j] = t[starti .. starti + leni — l]nexti. We require that 
starti < Sj<j l en j if lerii > 0. The representation is not self-referential if all frag- 
ments we are referring to are already defined, i.e., starti + lerii — 1 < 2j<j ^ en j 
for all i. The sequence of triples is often called the LZ parse of text. 

Straight-line program is a context-free grammar in the Chomsky normal form 
such that the nonterminals X\, A2, . . . , X s can be ordered in such a way that 
each Xi occurs exactly once as a left side, and whenever Xi — > XjX^ it holds 
that j, k < i. We identify each nonterminal with the unique string it derives, so 
|A| stands for the length of the string derived from X. We call a straight-line 
program (SLP) balanced if for each production X — > YZ both \Y\ and \Z\ are 
bounded by a constant fraction of \X\. 

We preprocess the pattern s using standard tools (suffix trees [TS] built for s 
and reversed s, and LCA queries 0) to get the following primitives. 

Lemma 2. Pattern s can be preprocessed in linear time so that given i,j,k 
representing any two fragments s[i . . i + k] and s[j . . j + k] we can find their 
longest common prefix (suffix) in constant time. 

Lemma 3. Pattern s can be preprocessed in linear time so that given any frag- 
ment s[i . . j] we can find its longest suffix (prefix) which is a prefix (suffix) of 
the whole pattern in constant time, assuming we know the (explicit or implicit) 
vertex corresponding to s[i . . j] in the suffix tree built for s (reversed s). 

Proof. We assume that the suffix tree is built for s concatenated with a special 
terminating character, say $. Each leaf in the suffix tree corresponds to some 
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suffix of s, and is connected to its parent with an edge labeled with a single 
letter. If we mark all those parents, finding the longest prefix which is a suffix of 
the whole s reduces to finding the lowest marked vertex on a given path leading 
the root, which can be precomputed for all vertices in linear time. □ 

We will also use the suffix array SA built for s [5] . For each suffix of s we store 
its position inside SA, and treat the array as a sequence of strings rather than 
a permutation of{l,2,...,|s|}. Given any word w, we will say that it occurs 
at position i in the SA if w begins s[£^4[z] . . \s\]. Similarly, the fragment of SA 
corresponding to w is the (maximal) range of entries at which w occurs. 

4 Snippets toolbox 

In this section we develop a few efficient procedures operating on fragments of 
the pattern, which we call snippets: 

Definition 1. A snippet is a substring of the pattern s[i . . j]. If i = 1 we call it 
a prefix snippet, if j = m a suffix snippet. 

We identify snippets with the substrings they represent, and use \s\ to denote 
the length of the string represented by s. A snippet is stored as a pair (i, j). 

The two results of this section that we are going to use later build heavily on 
the contents of [7]. Specifically, Lemma [6] appears there as Lemma 5. To prove it, 
we first need the following simple and relatively well known property of borders. 

Lemma 4. // the longest border of t is of length b > then all borders of 
length at least ^ create one arithmetic progression. More specifically, border(i)n 
. . . , = — ap : < a < j, where p = \t\ — b is the period oft. We 
call this set the long borders oft. 

By applying the preprocessing from the Knuth-Morris-Pratt algorithm to s 
and s r we can extract borders of prefix and suffix snippets efficiently. 

Lemma 5. Pattern s can be preprocessed in linear time so that we can find the 
longest border of each its prefix (suffix) in constant time. 

The first result tells how to detect an occurrence in a concatenation of two 
snippets. We will perform a lot of such operations. 

Lemma 6 (Lemma 5 of [7J). Given a prefix snippet and a suffix snippet we 
can detect an occurrence of the pattern in their concatenation in constant time. 

Proof. We need to answer the following question: does s occur in s[l . . i]s\j . . m]? 
Or, in other words, is there x £ border{s[\ . . i]) and y € border(s[j . . m]) such 
that x + y — m? Note that either x > ^^"^ or y > W^ii > and without losing 
the generality assume the former. From Lemma|4]we know that all such possible 
values of x create one arithmetic progression. More specifically, x = i — ap, where 
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p < | is the period of s[l . .i] extracted using Lemma [5] We need to check if 
there is an occurrence of s in s[l . . i]s[j . . m] starting after the ap-th character, 
for some < a < For any such possible interesting shift, there will be no 
mismatch in s[l . . i]. There might be a mismatch in s[j . . m], though. 

Let k > i be the longest prefix of s for which p is a period (such k can be 
calculated efficiently by looking up the longest common prefix of s[p + 1 . .m] 



and the whole s). We shift s[l . . k] by 



mini 



(M-j+i) 



p characters. Note this is the 



p 

maximal shift of the form ap which, after extending s[l . . k] to the whole s, does 
not result in sticking out of the right end of s[j . . m]. Then compute the leftmost 
mismatch of the shifted s[l . . k] with s[j . .m], see Figure [I] Position of the first 
mismatch, or its nonexistence, allows us to eliminate all but one interesting shift. 
More precisely, we have two cases to consider. 

1. There is no mismatch. If k = m we are done, otherwise s[fc + l] ^ s[k + l — p], 
meaning that choosing any smaller interesting shift results in a mismatch. 

2. There is a mismatch. Let the conflicting characters be a and b and call the 
position at which a occurs in the concatenation the obstacle. Observe that 
we must choose a shift ap so that s[l . . k] shifted by ap is completely on 
the left of the obstacle. On the other hand, if s[l . . k] shifted by (a + l)p is 
completely on the left as well, shifting s[l . . k] by ap results in a mismatch 
because s[fc + l] ^ s[k+l— p] and s[fc + l— p] matches with the corresponding 
character in s[j . . m]. Thus we may restrict our attention to the largest shift 
for which s[l . . k] is on the left of the obstacle. 

Having identified the only interesting shift, we verify if there is a match using 
one longest common prefix query on s. More precisely, if the shift is ap, we check 
if the common prefix of s[i — ap . . m] and s[j . . m] is of length \s[i — ap . .m]\. 
Overall, the whole procedure takes constant time. □ 
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Fig. 1. Detecting an occurrence in a concatenation of two snippets. 



The second result can be deduced from Lemma 6 and Lemma 8 of [7], but 
we prefer to give an explicit proof for the sake of completeness. Its running time 
is constant as long as |si| is bounded from above by a constant fraction of \s2\- 

Lemma 7. Given a prefix snippet Si and a snippet s 2 for which we know the 
corresponding (explicit or implicit) node in the suffix tree, we can compute the 

longest prefix of s which is a suffix of S!S 2 in time O (max f 1, log i|4 j j . 
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Proof. We try to find the longest border of Si = s[l . . i] which can be extended 
with s 2 . If there is none, we use Lemma[3]on s 2 to extract the answer. Of course 
Si might happen to have quite a lot of borders, and we do not have enough 
time to go through each of them separately. We try to abuse Lemma [4] instead: 
there are just log |si | groups of borders, and we are going to process each of 
them in constant time. It is not enough though, we need something faster when 
| «2 1 is relatively big compared to |si|. The whole method works as follows: as 
long as | s 2 1 is smaller than 2|si|, we check if it is possible to extend any of the 
long borders of s\. If it is not possible, we replace si with the longest prefix of 
s which ends • ■ \ s i\] ( we can preprocess such information for all prefixes 

of s in linear time). When | s 2 1 exceeds 2|si|, we look for an occurrence of s 2 
in a prefix of s of length \s\ \ + |s 2 |. AH such occurrences create one arithmetic 
progression due to Lemma |4j and it is possible to detect which one is preceded 
by a suffix of S\ in constant time. More specifically, we show how to implement in 
constant time the following two primitives. In both cases the method resembles 
the one from Lemma [6l 

1. Computing the longest long border of s± which can be extended with s 2 to 
form a prefix of s, if any. First we compute the period p of s% in constant 
time due to Lemma |3j then p < Hp- and any long border begins after the 
ap-th letter, for some a > 0. We compute how far the period extends in both 
s and s 2 , this gives us a simple arithmetic condition on the smallest value of 
a. More explicitly, there is either at most one valid a, or all are correct. 

2. Detecting the rightmost occurrence of s 2 in s preceded by a suffix of si, 
assuming |s 2 | > 2|si|. We begin with finding the first and the second occur- 
rence of s 2 in s. Assuming we have the corresponding vertex in the suffix tree 
available, this takes just constant time. We check those (at most) two occur- 
rences naively. There might be many more of them, though. But if the two 
first occurrences begin before the |si|-th character, we know that all other 
interesting occurrences form one arithmetic progression with the known pe- 
riod of s 2 . We check how far the period extends in si (starting from the right 
end) and s (starting from the first occurrence of s 2 ), this again gives us a 
simple arithmetic condition on the best possible shift. 

□ 

5 Constructing balanced grammar 

Recall that a LZ parse is a sequence of triples [starts lem, nexti) for i = 
1,2, ... ,n. In the not self-referential variant considered in [4], we require that 
starti + lerii — 1 < J2j<i ^ en j so that each triple refers only to the prefix gener- 
ated so far. Although such assumption is made by some LZ-based compressors, 
[Hj deals with the compressed pattern matching problem in its full generality, 
allowing self-references. Thus for the sake of completeness we need to construct 
a balanced grammar from a potentially self-referential LZ parse. It turns out 
that a small modification of a known method is enough for this task. 
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Lemma 8 (see Theorem 1 of |4j). Given a (potentially self-referential) LZ 
parse of size n, we can build a a-balanced SLP of size 0(nlog — ) describing the 

same string of length N , for any constant < a < 1 — ^ . Running time of the 
construction is proportional to the size of the output. 

Proof. At a very high level, the idea of [3] is to process the parse from left- 
to-right. When processing a triple (starti, leni,nexti), we already have an a- 
balanced SLP describing the prefix of the whole text corresponding to the pre- 
viously encountered triples. Because the grammar is balanced, we can define 
t[starti . . starti + lent — 1] by introducing a relatively small number of new non- 
terminals (with small actually meaning small in the amortized sense). Now if 
we allow the parse to be self- referential, it might happen that t[starti . . starti + 
leni — 1] sticks out from the right end of t[l . . ^ n j}- In such case we do 

as follows: let L = 2j=i l en j, and split the fragment corresponding to the cur- 
rent triple into three parts. First we have t[starti . . L], then some repetitions 
of the same fragment, and then t[starti . . leni mod (L — starti + 1)] followed 
by a single letter nexti. After defining a nonterminal deriving t[starti . . L], we 
can define a nonterminal deriving the repetitions at the expense of introduc- 
ing at most 2 log leni new nonterminals. Then we define a nonterminal deriving 
t[starti . . leni mod (L — starti + l)]nexti. The only change in the analysis of this 
method is that we might end up adding ^" =1 loglenj new nonterminals, which 
by the concavity of log is at most 0(nlog^), and thus does not change the 
asymptotic upper bound. Note tha t the authors of [3] were not concerned with 
the computational complexity of their algorithm. Nevertheless, it is easy to see 
that the only place which cannot be amortized by the number of new nontermi- 
nals is finding the corresponding place at the so-called active symbols list and 
traversing the grammar top-down in order to find the appropriate nonterminal. 
The former can be implemented by storing the active list in a balanced search 
tree, adding O(nlogn) to the time. The latter adds just C(nlog N) to the whole 
running time. Hence we can implement the whole method in 0(n log N). In order 
to decrease this complexity to just O(nlog^), we cut the string into n parts 
of roughly the same size. Note that this requires that our computational model 
allows constant time integer division. 

Note that the algorithm in [J] contains one special case: if the compression 
ratio is at most 2e, the trivial grammar is returned. We do the same. □ 

As a result we get a context-free grammar in which all nonterminals derive 

exactly one string, and right sides of all productions are of the form XY with 
\x\ 1 — 

iz^ < Tpr < ~^r- The exact value of a is not important, we only need the fact 

that both jy] ana - pj are bounded from above. For the sake of concreteness 
we assume a = 0.25. We also need to compute \X\ for each nonterminal X, 
and to group the nonterminals according to the (rounded down) logarithm of 
their length, with the base of the logarithm to be chosen later. Note that taking 
logarithms of large numbers (i.e., substantially longer than logn bits) is not 
necessarily a constant time operations in our model. We can use the fact that the 
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grammar is balanced here: if X —> YZ, then log b \X\ < /3+max (log b \Y\, \og b \Z\) 
for some constant j3 depending only on a and b, and the logarithms can be 
computed for all nonterminals in a bottom-up fashion using just linear time. 

6 Processing balanced grammar 

While the final goal of this section is a C(n log ^ + to) time algorithm, we start 
with a simple O(nlog — log m + m) time solution, which then is modified to take 
just 0(n\og — + m log to), and finally 0(n log — + to) time. 

For each nonterminal X we would like to check if the string it represents 
occurs inside s. If it does not, we would like to compute prefix(X) and suffix(JT), 
the longest prefix (suffix) which is a suffix (prefix) of the whole s. Given such 
information for all possible nonterminals, we can easily detect an occurrence: 

Lemma 9. If s occurs in a string represented by a SLP then there exists a 
production X — > YZ such that s occurs in suffix(Y) prefix(Z). 

Proof. Consider the leftmost occurrence of s. Take the starting symbol X = S 
and its production X — > YZ. If the leftmost occurrence is completely inside Y 
or Z, repeat with X replaced with Y or Z. Otherwise the occurrence crosses 
the boundary between Y and Z, in other words there is a prefix snippet s[l . . i] 
ending Y and a suffix snippet s[i + 1 . . to] starting Z. Then |sufEx(Y)| > i and 
|prefix(Z)| > m — i, and s occurs in suffix(Y") prefix(Z). □ 

Theorem 1. Given a (potentially self-referential) Lempel-Ziv parse of size n 
describing a text t[l . . N] and a pattern s[l . . m], we can detect an occurrence of 
s inside t deterministically in time 0(rtlog — log to + to). 

Proof. By Lemma [8] and Lemma [9j we only have to compute for each nonter- 
minal X its corresponding snippet (if any) and both prefix(JT) and suffix(X). 
We process the productions in a bottom-up order. Assume that we have the 
information concerning Y and Z available and would like to process X — > YZ. 
If both Y and Z correspond to substrings of s, we can apply binary search in the 
suffix array to check if their concatenation does as well in C(logm) steps, each 
step consisting of two applications of Lemma [2] used to compare the concatena- 
tion with a suffix of s. To compute prefix(X) and suffix(X) in O(logTO) time we 
could use Lemma [7J There is one difficulty here, though: we need to know the 
corresponding node in the suffix tree. To this end we show how to preprocess 
the tree in linear time so that the corresponding (implicit or explicit) node can 
be found in O(\ogm) time. 

If we allow as much as 0{m\ogm) preprocessing time, the implementation 
is very simple: for each vertex of the suffix tree we construct a balanced search 
tree containing all its ancestors sorted according to their depths. Constructing 
the tree for a vertex requires inserting just one new element into its parent tree 
(note that most standard balanced binary search trees can be made persistent 
so that inserting a new number creates a new copy and does not destroy the old 
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one) and so the whole construction takes O(mlogm) time. This is too much by 
a factor of logm, though. We use the standard micro-macro tree decomposition 
to remove it. The suffix tree is partitioned into small subtrees by choosing at 
most lo ™ m macro nodes such that after removing them we get a collection of 
connected components of at most logarithmic size. Such partition can be easily 
found in linear time. Then for each macro node we construct a binary search tree 
containing all its macro ancestors sorted according to their depths. There are 
just lo " m macro nodes so the whole preprocessing is linear. To find the ancestor 
v at depth d we first retrieve the lowest macro ancestor u of v by following at 
most log M edges up from v. If none of the traversed vertices is the answer, we 
find the macro ancestor of u of largest depth not smaller than d using the binary 
search tree in O(logm) time. Then retrieving the answer requires following at 
most log m edges up from u. □ 

We would like to remove the logm factor from the above complexity. It 
seems that the main difficulty here is that we need to implement a procedure 
for detecting if a concatenation of two substrings of s occurs in s as well, and in 
order to get the claimed running time we would need to answer such queries in 
constant time after a linear (or close to linear) preprocessing. We overcome this 
obstacle by choosing to work with an approximation of this information instead 
and using the fact that the grammar we are working with is balanced. 

Definition 2. A cover of a nonterminal X is pair of snippets s[i . . i + 2 k — 1] 
and s[j . . j + 2 k - 1] such that 2 k < \X\ < 2 k+1 , s[i . . i + 2 k - 1] is a prefix of the 
string represented by X , and s[j . . j + 2 k — 1] is a suffix of the string represented 
by X. We call k the order of X 's cover. 

We try to find the cover of each nonterminal X. If there is none, we know 
that the string it represents does not occur inside s. In such case we compute 
prefix(X) and suffix(X). More precisely, we either: 

1. compute the cover, in such case the string represented by X might or might 
no occur in s, 

2. do not compute the cover, in such case the string represented by X does not 
occur in s. 

As we will see later, it is possible to extract prefix(X) and suffix(X) from the 
cover of X using Lemma[7]in constant time, and the information about prefix(X) 
and suffix(X) for each nonterminal X is enough to detect an occurrence. 

To find the covers we process the nonterminals in groups. Nonterminals in the 
fc-th group Q t = {X U X 2 , ...X s } are chosen so that (§)* < |X*| < (f)* +1 . The 
groups are disjoint so J2i \ — O(nlog ^). Furthermore, the partition can be 
constructed in linear time. We start with computing the covers of nonterminals 
in Qi naively. Then we assume that all nonterminal in Ge-i are already processed, 
and we consider Qi. Because the grammar is 0.25-balanced, if Xi — > YiZi then 
< f \Xi\, and Yi, Zi belong to already processed Q t > with I — 5 < £' < I. 
If for some Yi or Zi we do not have the corresponding cover, neither must have 



Pattern matching in Lempel-Ziv compressed strings 



11 



the corresponding JQ, so we use Lemma[7]to calculate prefix(Xi), suffix(JQ), and 
remove X$ from Qg. For all remaining X t we are left with the following task: given 
the covers of Yi and compute the cover of Xj, or detect that the represented 
string does not occur in s and so we do not need to compute the cover. Note that 
the known covers are of order k with k m i n — \_£ log |J —3 < k < \£\og |] = k max . 

We reduce computing covers to a sequence of batched queries of the form: 
given a sequence of pairs of snippets s[i . . i + 2 kl — 1], s\j . . j + 2 k2 — 1] does 
their concatenation occur in s, and if so, what is the corresponding snippet? We 
call this merging the pair. For each £ we will require solving a constant number 
of such problems with k m i n < k\,k2 < k max , each containing C(|^£|) queries. 
We call this problem Batched-powers-merge. Before we develop an efficient 
solution for such question, lets see how it can be used to compute covers. 

Lemma 10. Computing covers of the nonterminals in any Qi can be reduced in 
linear time to a constant number of calls to Batched-powers-merge, with the 
number of pairs in each call bounded by \Ge\. 

Proof. Recall that for each given pair of snippets we have their covers available, 
and the orders of those covers are from {k, k + 1, . . . , k + 4}. Consider the situa- 
tion for a single pair, see Figure [2] Let a, b be the cover of the first snippet and 
c,d the cover of the second snippet. First we merge b and c to get merge(6, c). 
Then we extend a to the right and d to the left by merging with the correspond- 
ing fragments of merge(6, c) of length 2 k , and call the results extend(a) and 
extend(d). Then we would like iteratively extend both a and d with fragments 
of such length as long as it does not result in sticking out of the considered 
word w. To do that, we need to have the snippets corresponding to those frag- 
ments available. Consider the situation for a: first we extract the snippets from 
merge(6, c), then from extend(d). We claim that we are always able to perform 
such extraction: if the next 2 k characters fall outside merge(6, c), the distance 
to the left boundary of d does not exceed 2 k and thus we can use extend(d). If 
during this extending procedure the merging fails, the pair does not represent 
a substring of s. Otherwise we get the snippet corresponding to the prefix and 
suffix of w of lengths \w\ — \ w\ mod 2 k , which allows us to extract the prefix and 
suffix of length 2 k ' where 2 k ' < \w\ < 2 k ' +1 7 because k < k' . 

To finish the proof, note that for a single pair we need a constant number 
of merges. Thus we can do the merging in parallel for all pairs in a constant 
number of calls to Batched-powers-merge. □ 

Now we only have to develop the algorithm for Batched-powers-merge. 
A simple solution would be to do a binary search in the suffix array built for 
s for each pair separately: we can compare s[i . . i + 2 kl — l]s\j . . j + 2 k2 — 1] 
with any suffix of s in constant time using at most two longest common prefix 
queries so a single search takes O(logm) time, which gets us back to the bounds 
from Theorem [T] In order to get a better running time we aim to exploit the 
fact that we are given many pairs at once. First observe that we can order all 
concatenations from a single problem efficiently. 
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s[i..i + 2 kl - 1] 


•y..i + 2*"-i] 






b 


c 


a 


d 









merge(f), c) 


extend(a) 







■ = length 2 k 



extend(d) 



Fig. 2. Computing cover of a pair of snippets. 



Lemma 1 1 . Given 0(\ Gi |) pairs of words of the form s [i . . i + 2 kl — 1] , s [j . . j + 
2 k2 — 1] with k min < ki,k 2 < k max we can lexicographically sort their concate- 
nations in time 0{\Qf\ + to 6 ) if \k ma x — k m i n \ € C(l). 

Proof. We split the words to be sorted into a constant number of chunks of 
length 2 fc """. Then we would like to assign numbers to those chunks so that 
m(s[i..i + 2 fc — - 1]) < vx(s\j..j + 2 fc — - 1]) iff s[i..i + 2 fc — - 1]) < lex 
s[j . . j+2 fcmin — 1]). To compute all nr(s[i . . i+2 kmin — 1}) we retrieve the positions 
of s[i . . m] in the suffix array. Then we sort the resulting list of C(|^|) integers 
using radix sort, i.e., by = rounds of counting sort. The time required by this 
sorting is linear plus 0(m e ). After sorting we scan the list and identify different 
suffixes with the same prefix of length 2 kmin , such suffixes belong to continuous 
blocks whose boundaries can be identified using longest prefix queries. Then the 
original task reduces to sorting a list of constant length vectors consisting of 
integers not exceeding m, which can be done efficiently using radix sort. □ 

We apply the above lemma to all calls to Batched-powers-merge cor- 
responding to nonempty Qi. If > m then clearly the corresponding Qg is 
empty, so the total running time of this part is just 0(m e logm + Y]p \Ge\) = 
<D(m + n log — ). Now that the queries in a single call to Batched-powers- 
merge are sorted, instead of performing a separate binary search for each of them 
we can scan the queries and the suffix array at once, resulting in a C(|^| + m) 
running time for each different t. This gives us the following total running time. 

Theorem 2. Given a (potentially self-referential) Lempel-Ziv parse of size n 
describing a text t[l . . N] and a pattern s[l . . m], we can detect an occurrence of 
s inside t deterministically in time C(nlog ^ + m log to). 

This is still not enough to improve |6] on all possible inputs. We would like 
to replace m log to by to in the above complexity by focusing on improving 
the running time of Batched-powers-merge. At a high level the idea is to 
consider the queries in a single call in sorted order, and for each of them perform 
a binary search starting from the place where the lexicographically previous pair 
was found at. This might be still too slow though. To accelerate the search we 
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develop a constant time procedure for locating the fragment of the suffix array 
corresponding to all occurrences of any s[i . . i + 2 k — 1]. 

Lemma 12. The pattern s can be processed in linear time so that given any 
s[i . . i + 2 k — 1] we can compute its first and the last occurrence in the suffix 
array of s in constant time. 

Proof. It is enough to show that the suffix tree T built for s can be preprocessed 
in linear time so that we can locate the (implicit or explicit) vertex corresponding 
to any fragment which is a power of 2 in constant time. For that we should 
locate an ancestor of a given leaf which is at specified depth 2 k . This can be 
reduced to the so-called weighted ancestor queries: given a node-weighted tree, 
with the weights nondecreasing on any root-to-leaf path, preprocess it to find the 
predecessor of a given weight among the ancestors of v efficiently. Unfortunately, 
all known solutions for this problem |5ll2j give nonconstant query time. We wish 
to improve this time by abusing the fact that only ancestors at depths 2 k are 
sought. First note that such ancestor is not necessarily an explicit vertex. We 
start with considering all edges of T. For each such edge e, we compute the 
smallest fc such that e contains an implicit vertex at depth 2 k (there might be 
none), and split the edge to make it explicit. We call all original vertices at depths 
being powers of 2, and all new vertices, marked. For each vertex v we would like 
to compute the depths of all its marked ancestors, see Figure [3] This can be 
done in linear time by a single top-bottom transversal, and the information 
can be stored in a single ©(log |s|)-bit word. More precisely, for each vertex v 
we construct a single word marked(v) with the fc-th bit set iff v has a marked 
ancestor at depth 2 . Then we construct T' = compress(T) containing only the 
leaves and marked vertices of T by collapsing all maximal fragments of T without 
such vertices, and build the level ancestor data structure for T" [3] allowing us to 
find the fc-th ancestor of any vertex in constant time. Now given i and k we first 
locate the leaf v corresponding to s[i . . \s\] in T, then take a look at its bitvector 
marked(v). We can compute in constant time t = {k' > k : k' £ marked(w)} and 
retrieve the i-th ancestor of v in T". Going back to T we get a node with the 
same (lexicographically) smallest and largest suffix in its subtree as the node 
corresponding to s[i . A + 2 k — 1]. 

While the structure of [3] does give a constant time answers, we can use a 
significantly simpler solution building on the fact that the depth of T' is just 
log to. First we use the standard micro-macro tree decomposition, which gives us 
a top fragment containing just lo " m leaves, and a collection of small trees on at 
most log to leaves. Note that in this particular case, the total number of vertices 
cannot be much larger than the number of leaves: the original tree contained 
vertices with outdegree 1, then we introduced at most one such vertex at each 
edge, and then we collapsed some parts of the tree. For each node in the top 
tree we store all log to answers explicitly. For each small tree we do as follows: 
first number its nodes in a depth-first order, then for each node compute a single 
bitvector containing the numbers of all its ancestors. To find the fc-th ancestor 
of a given vertex v, we consider two cases. 
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1. v belongs to the top tree. Then we have the answer available. 

2. v belongs to some small tree. We first check in constant time if its depth in 
this small tree does not exceed k. If it does, we can use the precomputed 
answers stored for the parent (in the top tree) of the root. Otherwise we take 
a look at the bitvector corresponding to v, and find its A:-th highest bit set 
to 1. Then we retrieve the node corresponding to this depth-first number. 

□ 

Observe that the above lemma can be used to give an optimal solution for 
a slight relaxation of the substring fingerprints problem considered in [5]. This 
problem is defined as follows: given a string s, preprocess it to compute any 
substring hash h s (s[i . . j]) efficiently. We require that: 

1. h s (s[i . .j]) € [1, C(|s| 2 )] so that the values can be operated on efficiently, 

2. h s (s[i..j}) = h s (s[k..l]) iSs[i..j] = s[k..t\. 

If we allow the range of h s to be slightly larger, say 0(|s| 3 ), a direct application 
of the above lemma allows us to evaluate the fingerprints in constant time. 

Theorem 3. Substring fingerprints of size C(|s| 3 ) can be computed in constant 
time after a linear time preprocessing. 

Proof. First we apply the preprocessing from Lemma [12] to s. We also store 
[logxj for any 1 < x < \s\. Then given a query s[z..j] we compute k = 
[log(j — i + 1)J and using constant time level ancestors queries we locate the 
lowest existing ancestors of both s[i . . i + 2 k — 1] and s[j — 2 k + 1 . . j] in the suffix 
tree. Then h s (s[i . . j]) is a triple containing j — i + 1 and those two ancestors. □ 



Now getting back to the original question, the input to Batched-power- 
merge is a sequence of pairs of snippets w±, w^, . . . , w\g t \. By Lemma 11 we can 
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consider them in a sorted order. For each such pair w = s[i . . i + 2 kl — l]s[j . . j + 
2 k ' 2 — 1], we first look up the fragment of the suffix array corresponding to its 



prefix s[i..i + 2 min — 1] using Lemma 12 Then we apply binary search in 
this fragment, with the exception that if the previous binary search was in this 
fragment as well, we start from the position it finished, not the beginning of the 
fragment. Additionally, the binary search is performed from the beginning and 
the end of the interval at the same time, see Two-way-binary-search. If the 
initial interval is [a, b] and the position we are after is r, such modified search 
uses just 0(logmin(r — a + 1, b — r + 1)) applications of Lemma [2] instead of 
C(log(6 — a + 1)) time, which is important. 



Algorithm 1 Two-WAY-BlNARY-SEARCH(a, b, w) 



1: 


x <S— a, y <— b 






2: 


k^l 






3: 


while 2 fc < b - a do 






4: 


if w <i ex s[SA[a + 2 k ]] then 
y^a + 2 k 






5: 






6: 


break 






7: 


end if 






8: 


if s[SA[b- 2 k ]\ < lex w then 






9: 


x <- b - 2 k 






10: 


break 






11: 


end if 






12: 


k «- k + 1 






13: 


end while 






14: 


r binary search for w in s[5^4[a;] . . 


s\],s[SA[x + l]..\ s \],. 


..,s[SA[y]..\s\] 


15: 


return r 







While a single binary search might require a non-constant time, we will show 
that their amortized complexity is constant. To analyze the whole sequence of 
those searches, we keep a partition of the whole [1, \s\] into a number of disjoint 
intervals. Doing a single search splits at most one interval into two parts at the 
position of the first occurrence. If the first occurrence is exactly at an already 
existing boundary, there is no split, otherwise we say that those two smaller 
intervals have been created in phase k m i n linearly depends on 

£), and intervals created in phase k m i n are kept in a list Ik min - We do not want 
to split an interval more than once and hence each call to Batched-powers- 
merge starts with finding for each Wi its corresponding interval in Ik mfn ■ After 
processing all concatenations, we add the new intervals to Ik min and prune it to 
contain the intervals which are minimal under inclusion. Scanning and pruning 
4 min takes linear time in its size, and we show that this size is small. 

Lemma 13. All O(logm) calls to Batched-POWERS-merge run in total time 
0(m + Y,t\Gt\)- 
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Algorithm 2 Batched-powers-merge(wi, w 2 , ■ • ■ , w\g e \) 
1: sort all Wi t> Lemma |11| 

2: scan Ik min to find the intervals containing Wi 
3: L <r- 
4: r <s— 1 

5: for i «— 1 to do 

6: [a, 6] the interval corresponding to iUi[l . . 2*™"] in 5^4 t> Lemma 

7: choose [c, d] G Ik min containing the first occurrence of Wi in SA 

8: if [c, d] is defined then 

9: a <— max(a, c) 

10: 6 «— min(6, d) 

11: end if 
12: a <— max(ri_i, a) 

13: rj T\VO-WAY-BINARY-SEARCH(a, b, Wi) 

14: add [a, ri] and [ri,6] to L 
15: end for 

16: sort L and merge it with Ik min , removing non-minimal intervals 
17: return all answers n 



Proof. First note that the sorting in line 16 can be performed in time 0(m e 4 
|ife mi J + \Ge\) using radix sort. Line[l]takes time C(m e + \Ge\) due to Lemma 11 



and line [2] requires G(\If. min \ + \Gi\)- All executions of line[7]take time 0(\lk min \) 
because the words Wi are already sorted. For the time being assume that the 
binary search in line 13 is for free. Then the total complexity becomes 0(^2 i m e + 



(i) 



where 



T (i) 



is the size of Ik min just before the i-th call to 
Batched-powers-merge. There is a constant number of those calls for each 
value of 1 < I < to, and each fc m j„ corresponds to at most constant number of 
different continuous values of I, thus the sum is in fact 0(m + Yle \Qt\)- 

To finish the proof we have to bound the time taken by all binary searches. 
For that to happen we will view the intervals as vertices of a tree. Whenever 
performing a binary search splits an interval into two, we add a left and right child 
to the corresponding leaf v, see Figure [4j The rank rank(u) of a vertex v is the 
rounded logarithm of its weight, which is the length of the corresponding interval. 
Then the cost of line 13 is simply 0(l+mm(rank(left(u)), rank (right (u )))) where 
left(w) and right (u) are the left and right child of v, respectively. Hence we should 
bound the sum J2 V min(rank(left(i>)), rank(right(w))), where v is a non-leaf. We 
say that a vertex is charged when its weight does not exceed the weight of its 
brother. Now we claim that there are at most S charged vertices of rank k: 
assume that there are u and v such that u is an ancestor of v, both are charged 
and of rank k, then weight of v plus weight of its brother is at least twice as 
large as the weight of v alone, thus the rank of their parent is larger than the 
rank of v, contradiction. So all charged vertices of the same rank correspond to 
disjoint intervals, and there cannot be more than || disjoint intervals of length 
at least 2 k on a segment of length to. Bounding the sum gives the claim: 
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log m 



fc>0 



min(rank(left(f)), rank(right(i>))) < < rn^^ = 2m □ 



A: 



fe>0 



2 k 



Fig. 4. Interpreting the intervals as a tree. 

Hence for all productions X — > Y"Z such that we have the cover of both Y 
and Z, we either computed the cover of X or decided that there is none. If for 
a production we cannot find the cover of X, we compute prefix(X), suffix(X) 
given the covers of Y and Z using a few applications of Lemma [7] with carefully 
chosen arguments. 

Lemma 14. Given the covers of Y and Z , we can compute prefix(X) and 
suffix(X) in constant time as long as Kj and j^j are bounded from above by 
a constant. To compute prefix(X) we can use prefix(Z) instead of the cover of 
Z, and sufBx(X) can be replaced with suffix(Y~) instead of the cover ofY. 

Proof. It is enough to consider prefix(X). The idea is to use a few application 
of Lemma[7]with carefully chosen arguments, see Figure [5j More specifically, let 
a, b and c, d be the covers of Y and Z, respectively. First we locate the vertex 



corresponding to d in the suffix tree, due to Lemma 12 and \d\ = 2 k it takes 
constant time, then: 

(1) apply Lemma [3] to compute prefix(ii) if we have the cover of Z, otherwise 
take the known prefix(Z) and go to (3), 

(2) apply Lemma[7|to c and prefix(d) without the first |c| + \d\ — \Z\ letters to 
get prefix 

(3) apply Lemma to b and prefbq to get prefix 2 , 

(4) apply Lemma [Tfto a and prefix 2 without the first \a\ + \ b\ — \ Y\ letters to get 
the desired answer prefix 3 . 

Note that whenever we apply the lemma to two words u and v, \v\ is a power 

to locate its corresponding node in constant 
Y ^^ z ^ and \v\ < |y| + \Z\ and so the running 



of 2 and so we can use Lemma 12 

min( 



time. Also, it holds that \u\ > 
time is bounded by: 



u 



max 1, log 7— r < max 1, log — . = log 1 



Y \ + \ Z \ \\ , max(|y|,|Z|) 



mm(\Yl\Z\)JJ b V min(|y|,|Z|) 



which is 0(1). □ 
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Z 








b 


c 




a 




d 









(1) 

(2) 
(3) 
(4) 



prefix(d) 



prefixi 



prefix2 



prefixg 



Fig. 5. Computing prefix(X) given the covers of Y and Z. 



Theorem 4. Given a 0.25-balanced SLP of size 0(n log ^) and a pattern s[l . . m] 
we can detect an occurrence of s in the represented text in time C(nlog — + m). 

Proof. By Lemma [10] and Lemma [13] we compute the covers of all nonterminals 
which represent subwords of s in time 0(n log + m). For the remaining nonter- 



minals X we use Lemma 14 to compute prefix(X) and sufnx(X) in total linear 
time considering the nonterminals in bottom-up order. Then due to Lemma [9] 
if there is an occurrence of s, there is an occurrence in prefix(Y") sufnx(Z) for 
some production X YZ. We consider every nonterminal X, either lookup the 
already computed prefix(Y") and suffix (.Z) or compute them using the known 
covers and Lemma |14| and use Lemma [6] to detect a possible occurrence. □ 

Theorem 5. Given a (potentially self-referential) Lempel-Ziv parse of size n 
describing a text t[l . . N] and a pattern s[l . .m], we can detect an occurrence of 
s inside t deterministically in time 0(nlog — + m). 



7 Conclusions 

Recall that in order to guarantee a C(nlog ^+m) running time, it was necessary 
to use integer division in the proof of Lemma [H] This was the only such place, 
though. If we assume that integer division is not allowed, and the only operations 
on the integers starti,leni appearing in the input triples are addition, subtrac- 
tion, multiplication and comparing with (which are the only operations used 
by the 0(n log N + m) version of our algorithm), we can prove a matching lower 
bound by looking at the corresponding algebraic computation trees. More pre- 
cisely, using standard tools [2] one can show that the depth of such tree which 
recognizes the set of integers t, Xx, x%, . . . , x n such that for all i it holds that 
Xl = (2a i + l)t + f3 l with < $ < t and < a { < N is Sl(n\ogN). On the other 
hand, one can construct a self-referential LZ of constant size deriving (1*0*)^. 
Hence one can also construct a LZ of size 0(n) deriving (1*0*)-% 1 ...b n l where 
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bi = [^-J mod 2. This string does not contain 11 as a substring iff all Xi are of 
the form Xi = (2on + l)t + fa and the lower bound follows. 
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